NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables.
We have big data from Kaggle and New York Taxi, each year taxi companies store a large amount of customer information for analysis to predict trends and generate insight.
The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
From the data below, the data processing stages including important steps will be listed in detail:
Fare-amount Analysis: Analyze sales that profit from each trip based on features: Fare_Amount, Trip_Duration.
Spatial Data Analysis: Analyze the density of each area on a heatmap combined with spatial data.
Trip Duration Analysis: Based on pick-up and drop-off locations, trip distances, I can give you a specific view of New York taxi user behavior using time seriese, day, month, year, say fare_amount and trip duration.
My project has invested a lot of time and effort as a specialized Data Analyst. I did learn a lot of different resources to make the EDA become purest.Because I will upload to github and share it with everyone.

import holoviews as hv
import geoviews as gv
import param, paramnb, parambokeh
import pandas as pd
import dask.dataframe as dd
## This package applied for the spatial data.
from colorcet import cm
from bokeh.models import WMTSTileSource
from holoviews.operation.datashader import datashade
from holoviews.streams import RangeXY, PlotSize
import ipywidgets as widgets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
# import lightgbm as lgbm .
import io
import os
import gc
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import os
# from jupyterthemes import jtplot
# jtplot.style(context='talk', fscale=1.4, spines=False)
from scipy import stats
from sklearn import cluster
plt.rcParams['font.size'] = 12
# plt.rcParams['axes.grid'] = False
# I want to move all graph into the center.
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
On the same data i split it into three parts one part 5000 rows,
Df_train5k meaning 5000 rows,
train meaning 500000 rows.
Each separate data_name is used in a different graph because it is a heavy dataset. Therefore, it takes a long time for the computer to load; I find a way to reduce it. If possible, I can use a computer with a powerful CPU to load extensive data with more than 1 million rows.
data = pd.read_csv(filepath_or_buffer='train.csv', engine='c', infer_datetime_format=True, parse_dates=[2,3])
#df_train with litmited rows
df_train = pd.read_csv('Fare-Predictions/train.csv', nrows=50000)
df_train5k = pd.read_csv('Fare-Predictions/train.csv', nrows=5000)
# Applied 5000 rows.
train = pd.read_csv('traintrip.csv', nrows=500000)
# if used train it a big data may lead to slow computer GPU.
train.head(10)
| id | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | store_and_fwd_flag | trip_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | id2875421 | 2 | 2016-03-14 17:24:55 | 2016-03-14 17:32:30 | 1 | -73.982155 | 40.767937 | -73.964630 | 40.765602 | N | 455 |
| 1 | id2377394 | 1 | 2016-06-12 00:43:35 | 2016-06-12 00:54:38 | 1 | -73.980415 | 40.738564 | -73.999481 | 40.731152 | N | 663 |
| 2 | id3858529 | 2 | 2016-01-19 11:35:24 | 2016-01-19 12:10:48 | 1 | -73.979027 | 40.763939 | -74.005333 | 40.710087 | N | 2124 |
| 3 | id3504673 | 2 | 2016-04-06 19:32:31 | 2016-04-06 19:39:40 | 1 | -74.010040 | 40.719971 | -74.012268 | 40.706718 | N | 429 |
| 4 | id2181028 | 2 | 2016-03-26 13:30:55 | 2016-03-26 13:38:10 | 1 | -73.973053 | 40.793209 | -73.972923 | 40.782520 | N | 435 |
| 5 | id0801584 | 2 | 2016-01-30 22:01:40 | 2016-01-30 22:09:03 | 6 | -73.982857 | 40.742195 | -73.992081 | 40.749184 | N | 443 |
| 6 | id1813257 | 1 | 2016-06-17 22:34:59 | 2016-06-17 22:40:40 | 4 | -73.969017 | 40.757839 | -73.957405 | 40.765896 | N | 341 |
| 7 | id1324603 | 2 | 2016-05-21 07:54:58 | 2016-05-21 08:20:49 | 1 | -73.969276 | 40.797779 | -73.922470 | 40.760559 | N | 1551 |
| 8 | id1301050 | 1 | 2016-05-27 23:12:23 | 2016-05-27 23:16:38 | 1 | -73.999481 | 40.738400 | -73.985786 | 40.732815 | N | 255 |
| 9 | id0012891 | 2 | 2016-03-10 21:45:01 | 2016-03-10 22:05:26 | 1 | -73.981049 | 40.744339 | -73.973000 | 40.789989 | N | 1225 |
# Clean data process
duration_mask = ((data.trip_duration < 60) | # < 1 min
(data.trip_duration > 3600 * 2)) # > 2 hours
print('Anomalies in trip duration, %: {:.2f}'.format(data[duration_mask].shape[0] / data.shape[0] * 100))
data = data[~duration_mask]
data.trip_duration = data.trip_duration.astype(np.uint16)
print('Trip duration in seconds: {} to {}'.format(data.trip_duration.min(), data.trip_duration.max()))
print('Empty trips: {}'.format(data[data.passenger_count == 0].shape[0]))
data = data[data.passenger_count > 0]
Anomalies in trip duration, %: 0.74 Trip duration in seconds: 60 to 7191 Empty trips: 17
# Function for missing data
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() /len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1)
mis_val_table_ren_columns = mis_val_table.rename(columns={0:'Missing Values', 1:'% of Total Values'})
mis_val_table_ren_columns=mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1]!=
0].sort_values('% of Total Values',ascending=False).round(1)
print("You selected dataframe has" + str(df.shape[1]) + "columns.\n"
"There are" + str(mis_val_table_ren_columns.shape[0]) +
"columns that have missing values.")
#dataframe
return mis_val_table_ren_columns
missingtrain = missing_values_table(train)
missingtrain
You selected dataframe has11columns. There are0columns that have missing values.
| Missing Values | % of Total Values |
|---|
df_train = df_train.dropna(how='any', axis=0)
missing = missing_values_table(df_train)
missing
You selected dataframe has8columns. There are0columns that have missing values.
| Missing Values | % of Total Values |
|---|
Result Check: After cross-checking the missing columns, our data is entirely clean with limited without any missing values and is being trustful. Including big data such as data in New York Centrum City, we have to ensure that there are fewer null values inside to support it work well in the next step because we will work with Spatial Data.
df_train
| key | fare_amount | pickup_datetime | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2009-06-15 17:26:21.0000001 | 4.5 | 2009-06-15 17:26:21 UTC | -73.844311 | 40.721319 | -73.841610 | 40.712278 | 1 |
| 1 | 2010-01-05 16:52:16.0000002 | 16.9 | 2010-01-05 16:52:16 UTC | -74.016048 | 40.711303 | -73.979268 | 40.782004 | 1 |
| 2 | 2011-08-18 00:35:00.00000049 | 5.7 | 2011-08-18 00:35:00 UTC | -73.982738 | 40.761270 | -73.991242 | 40.750562 | 2 |
| 3 | 2012-04-21 04:30:42.0000001 | 7.7 | 2012-04-21 04:30:42 UTC | -73.987130 | 40.733143 | -73.991567 | 40.758092 | 1 |
| 4 | 2010-03-09 07:51:00.000000135 | 5.3 | 2010-03-09 07:51:00 UTC | -73.968095 | 40.768008 | -73.956655 | 40.783762 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 2013-06-12 23:25:15.0000004 | 15.0 | 2013-06-12 23:25:15 UTC | -73.999973 | 40.748531 | -74.016899 | 40.705993 | 1 |
| 49996 | 2015-06-22 17:19:18.0000007 | 7.5 | 2015-06-22 17:19:18 UTC | -73.984756 | 40.768211 | -73.987366 | 40.760597 | 1 |
| 49997 | 2011-01-30 04:53:00.00000063 | 6.9 | 2011-01-30 04:53:00 UTC | -74.002698 | 40.739428 | -73.998108 | 40.759483 | 1 |
| 49998 | 2012-11-06 07:09:00.00000069 | 4.5 | 2012-11-06 07:09:00 UTC | -73.946062 | 40.777567 | -73.953450 | 40.779687 | 2 |
| 49999 | 2010-01-13 08:13:14.0000007 | 10.9 | 2010-01-13 08:13:14 UTC | -73.932603 | 40.763805 | -73.932603 | 40.763805 | 1 |
50000 rows × 8 columns
## For df_train add some importance feature from datetime for time serise analyze.
df_train['pickup_datetime'] = pd.to_datetime(df_train['pickup_datetime'])
df_train['pickup_datetime_month'] = df_train['pickup_datetime'].dt.month
df_train['pickup_datetime_year'] = df_train['pickup_datetime'].dt.year
df_train['pickup_datetime_day_of_week'] = df_train['pickup_datetime'].dt.weekday
df_train['pickup_datetime_day_of_hour'] = df_train['pickup_datetime'].dt.hour
df_trains = df_train
data.head(1)
# Check again the dataframe
| id | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | store_and_fwd_flag | trip_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | id2875421 | 2 | 2016-03-14 17:24:55 | 2016-03-14 17:32:30 | 1 | -73.982155 | 40.767937 | -73.96463 | 40.765602 | N | 455 |
correlation = np.corrcoef(df_train['fare_amount'], df_train['passenger_count'])
## The Correlation of fair_amount
df_train = df_train.drop(['passenger_count'], axis = 1)
## Passed the passenger_count.
df_train = df_train.drop(['pickup_datetime'], axis =1)
## Passed the pickup_datetime.
df_train = df_train.drop(['key'], axis = 1)
## Passed the key.
# Features for data
data = data[data.passenger_count > 0]
#Convert this feature into categorical type
data.store_and_fwd_flag = data.store_and_fwd_flag.astype('category')
#month (pickup and dropoff)
data['mm_pickup'] = data.pickup_datetime.dt.month.astype(np.uint8)
data['mm_dropoff'] = data.dropoff_datetime.dt.month.astype(np.uint8)
#day of week
data['dow_pickup'] = data.pickup_datetime.dt.weekday.astype(np.uint8)
data['dow_dropoff'] = data.dropoff_datetime.dt.weekday.astype(np.uint8)
# day hour pickup and drop off
data['hh_pickup'] = data.pickup_datetime.dt.hour.astype(np.uint8)
data['hh_dropoff'] = data.dropoff_datetime.dt.hour.astype(np.uint8)
# Set the daily name.
dow_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
## the style of sns.
# sns.set(style="white")
# Generate a large random dataset by repeat from train.copy()
temp3 = train.copy()
# Compute the correlation matrix
corr = temp3.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 13))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<ipython-input-13-05190077e924>:11: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
<AxesSubplot:>
Positive Red: Both variables can occur in the same trend, this one increase the other increase.
Negative Light: Happens the opposite way.
The correlation between the properties shown on the image some things to note along with passenger_count is also positive , drop_off latitude and pickup_latitude are red which means they are positively compatible with each other about 0.30 point, both they are almost the same. If either one increases, they both increase.
Light color means negative for other features. That is, when one variable's value increases, the other variables' values decrease in vendor_id, pickup_lat.
plt.figure(figsize=(12,2))
datas = data.groupby('dow_pickup').aggregate({'id':'count'}).reset_index()
sns.barplot(x='dow_pickup', y='id', data=datas, )
## PLOT
plt.title('Pick-Up Passenger Weekday Distribution')
plt.xlabel('Trip Duration, minutes')
plt.xticks(range(0, 7), dow_names, rotation='horizontal')
plt.ylabel('No of Trips made')
Text(0, 0.5, 'No of Trips made')
The pickup time distribution by weekdays in a week. It seems like Friday is the most popular day to hail a taxi with close to 220,000 trips made, while Sunday is on the other end of the spectrum with approximately 165,000 trips.
plt.figure(figsize=(12,2))
datas1 = data.groupby('hh_pickup').aggregate({'id':'count'}).reset_index()
sns.barplot(x='hh_pickup', y='id', data=datas1)
plt.title('Pick-ups Hour Distribution')
plt.xlabel('Hour of Day, 0-23')
plt.ylabel('No of Trips made')
# plt.savefig('Figures/pickups-hour-distribution.png')
Text(0, 0.5, 'No of Trips made')
f = plt.figure(figsize=(10,8))
days = [i for i in range(7)]
sns.countplot(x='dow_pickup', data=data, hue='hh_pickup', alpha=0.8)
plt.xlabel('Day of the week', fontsize=14)
plt.ylabel('Pickup count', fontsize=14)
plt.xticks(days, ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
plt.legend(loc=(1.04,0))
plt.show()
Weekdays are fairly stable, in the evening about 19 to 23 heart is time the most dense, more than 12000 times the pickup range. Wednesday and Thursday there are many departures at the pick-up point, and Sunday there are few departures.
Between 1 am and 5 am, there are fewer departures. From 9 am to noon at that time is also higher than 800 times to pick up guests.
f = plt.figure(figsize=(11,6))
pass_count = train['passenger_count'].value_counts()
sns.barplot(pass_count.index, pass_count.values, alpha=0.8)
plt.xlabel('Number of passengers on a trip', fontsize=14)
plt.ylabel('Count of Customer', fontsize=14)
plt.title('Passenger count on each trip')
plt.show()
/home/markn/.local/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Reasonable amounts from about 0 to 200 Dollar. It is the amount that customers spend on each trip. The chart below helps us clearly see the trend and level of spending by specific hours.
This list will specify for each hour each expenditure based on the level of the trip dense.
plt.style.use('fivethirtyeight')
plt.figure(figsize = (10, 10))
for h, grouped in df_train.groupby('pickup_datetime_day_of_hour'):
sns.kdeplot(grouped['fare_amount'], label = f'{h} hour');
# Use kdeplot technique to plot this graph.
# Label per hours
plt.title('Fare Amount by Hour of Day');
plt.legend()
# Add the legend of each hour.
<matplotlib.legend.Legend at 0x7fa22434f490>
As you've seen, where there's more than 0.10 density, a fairly affordable client is in the range of 0 to $30 that's the peak of density. Around 40 to 50 dollars there is some degree of density from 0 to 0.01. There are very few trips exceeded 100 euros. Most of the trips away from the city's outskirts.
Where dense coverage and low cost has mostly short trips through the location in the center of New York.
Where low density but high costs majority happening at the airport when customer move the shuttle back and forth between the center and the airport.
I use the time_slicer function to represent many different graphs at the same time. Because users have many different time period as week, month, weekday.
def time_slicer(df, timeframes, value, color="purple"):
"""
Function to count observation occurrence through different lenses of time.
"""
f, ax = plt.subplots(len(timeframes), figsize = [10,10])
for i,x in enumerate(timeframes):
df.loc[:,[x,value]].groupby([x]).mean().plot(ax=ax[i],color=color)
ax[i].set_ylabel(value.replace("_", " ").title())
ax[i].set_title("{} by {}".format(value.replace("_", " ").title(), x.replace("_", " ").title()))
ax[i].set_xlabel("")
ax[len(timeframes)-1].set_xlabel("Time Frame")
plt.tight_layout(pad=0)
plt.style.use('fivethirtyeight')
time_slicer(df=df_train, timeframes=["pickup_datetime_month", "pickup_datetime_year", "pickup_datetime_day_of_week", "pickup_datetime_day_of_hour"], value = "fare_amount", color="blue")
According to some simulations above. This graph details the Fair_Amounts that occur by month, year, and hour of the day.
Month: before that there was also a graph showing in the month I just repeated to check the presence of the graph.
Year: The amount increased gradually from 2009 to the peak of 2014 which is quite special, the day the fees are getting higher and higher.
Day: Rates during weekdays are high on Tuesdays and lowest on Thursdays and weekends are also high.
Hours: Rates during the hours of the day spike early in the morning at 5 AM.
## Version 1 Quite Advanced.
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
import plotly.io as pio
# set the background of dark theme.
pio.templates.default = "plotly_dark"
# def HeatPlot(data, down_names):
def HeatPlot1(data, down_names):
durationsweekly = data.dow_pickup
hours = data.hh_pickup
## imshow is part of heatmap technique function.
fig = px.imshow(pd.crosstab(data.dow_pickup, data.hh_pickup, values=data.vendor_id, aggfunc='count', normalize='index'),
labels=dict(x="Time of Day", y="Weekly"))
# label by time of the day, and y in weekly.
fig.update_layout(
yaxis = dict(
tickmode = 'array',
tickvals = [0, 1, 2, 3, 4, 5, 6],
ticktext = dow_names
)
)
fig.update_xaxes(side="top")
fig.show()
HeatPlot1(data, dow_names)
When looking at the heatmap from Saturday and Sunday, There are few pick-ups taking place in the early morning from 5am to 9am.During weekdays. Taxis starting to pick up passengers at about 7-8am to afternoon. Most of it is quite busy with high density, when staying with a bright yellow color in the evening from 16 to 20 pm. Density with more than 0.05 for heatmap.
On Sunday we can also see customers picking up cars at midnight.With the density also bright from 0.05. The time is at 24 o'clock to 2 3 am.
Going further, we will have a closer look at the trip distribution thanks to the heatmap.
df_plot = df_train.pivot_table('fare_amount', index='pickup_datetime_day_of_hour', columns='pickup_datetime_year')
sns.set_style("whitegrid", {'axes.grid' : False})
df_plot.plot(figsize=(14,6), cmap="plasma")
plt.ylabel('Fare $USD');
plt.xlabel('Hour of the days');
plt.style.use('fivethirtyeight');
Analysis:
We can see the fair $ for hour during 24 hours in Difference year around 5 AM it very high density number. Also a bit rasing after 3 Pm. Morning here observed as the rush hours.
2009 The cheapest time of the year with the highest price in that time is more than 12 dollars. At noon, it's also quite affordable for people from less than 10 dollars to 12 dollars. Located between 10:00 a.m. and 3:00 p.m. on the x_axis.
2010 the trip fare amoutn from 0 to 24 increase a small amount.The average cost is neither too high nor too low. Such morning time is still the busiest time and the highest price that customers pay when it is time to start a working day in an energetic city like New York.
2011 It still stay on the normal level of price. The good stable and most affordable rates for residents in new york at the moment when compared to other years.
2012 Costs become high mutation and abnormal. Meanwhile, the high prices suddenly jump from under 14 to over 18 dollars in euros fair amount at the time of the peak of the morning at 5 am
2013 At around 5 AM, costs at their highest ddiem seem to have reached an unusual limit and peak compared to previous years. then there is a rush hour wave between 16 PM. That time may be the time for office workers to go home. The afternoon is still stable. Freight transportation prices that reach over 16 dollars at USD fair amount.
TRIP_DURATION skewed variables must have a long tail. Therefore the graph below helps me normalized tail length distribution. Make it become more balanced and all.
train['log_trip_duration'] = np.log1p(train['trip_duration'].values)
## Add log trip for trip duration.
fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,8))
fig.suptitle('Train trip duration and log of trip duration')
ax1.legend(loc=0)
ax1.set_ylabel('count')
ax1.set_xlabel('trip duration')
ax2.set_xlabel('log(trip duration)')
ax2.legend(loc=0)
ax1.hist(train.trip_duration,color='black',bins=7)
ax2.hist(train.log_trip_duration,bins=50,color='Green');
No handles with labels found to put in legend. No handles with labels found to put in legend.
Result of Normalzation Trip Log:
The log of trip_duration distributes normally and we can also identify peaks. If when applied to the response variable in the numerical data model. We need to note the conversion of the trip time in the log to the basic form.
There is a small outlier on the far right but its impact is small. We have a maximum value of Log1p of numpy which simply means 3.52628210^{6} which is equivalent to 41 days. There are times when the driver opened the box meterde data storage and there are times when the driver is not open to charge.
print('Old size: %d' % len(df_train))
df_train = df_train[df_train.fare_amount>=0]
print('New size: %d' % len(df_train))
Old size: 50000 New size: 49994
We then filtered out expenses greater than $0. I still do not know but I guess some flights not charged.
df_train[df_train.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3), color="blue")
plt.xlabel('fare $USD')
plt.title('Histogram');
Then I also check the different times of costs under a hundred dollars. Most of the customers have paid as high as 6000 times in the range of 5 dollars more.
The majority began descending deviation to the right.
## Function support the code
def distance(lat1, lon1, lat2, lon2):
p = 0.017453292519943295 # Pi/180
a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) # 2*R*asin...
# First calculate two arrays with datapoint density per sq mile
n_lon, n_lat = 200, 200 # number of grid bins per longitude, latitude dimension
density_pickup, density_dropoff = np.zeros((n_lat, n_lon)), np.zeros((n_lat, n_lon)) # prepare arrays
## BB mean basic coordinates which stand on the map of new york in USA.
BB = (-74.5, -72.8, 40.5, 41.8)
# To calculate the number of datapoints in a grid area, the numpy.digitize() function is used.
# This function needs an array with the (location) bins for counting the number of datapoints
# per bin.
bins_lon = np.zeros(n_lon+1) # bin of longitude.
bins_lat = np.zeros(n_lat+1) # bin of latitude
delta_lon = (BB[1]-BB[0]) / n_lon # bin longutide width
delta_lat = (BB[3]-BB[2]) / n_lat # bin latitude height
bin_width_miles = distance(BB[2], BB[1], BB[2], BB[0]) / n_lon # bin width in miles
bin_height_miles = distance(BB[3], BB[0], BB[2], BB[0]) / n_lat # bin height in miles
for i in range(n_lon+1):
bins_lon[i] = BB[0] + i * delta_lon
for j in range(n_lat+1):
bins_lat[j] = BB[2] + j * delta_lat
# Digitize per longitude, latitude dimension
inds_pickup_lon = np.digitize(df_train.pickup_longitude, bins_lon)
inds_pickup_lat = np.digitize(df_train.pickup_latitude, bins_lat)
inds_dropoff_lon = np.digitize(df_train.dropoff_longitude, bins_lon)
inds_dropoff_lat = np.digitize(df_train.dropoff_latitude, bins_lat)
# Count per grid bin
# note: as the density_pickup will be displayed as image, the first index is the y-direction,
# the second index is the x-direction. Also, the y-direction needs to be reversed for
# properly displaying (therefore the (n_lat-j) term)
dxdy = bin_width_miles * bin_height_miles
for i in range(n_lon):
for j in range(n_lat):
density_pickup[j, i] = np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j))) / dxdy
density_dropoff[j, i] = np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j))) / dxdy
Display the plot of fare amount <100 $ USD. All data bar chart focus on understand how trip fair going on. As we can see the highest is > 6000 trip.
df_train['distance_miles'] = distance(df_train.pickup_latitude, df_train.pickup_longitude, \
df_train.dropoff_latitude, df_train.dropoff_longitude)
df_train['fare_per_mile'] = df_train.fare_amount / df_train.distance_miles
# scatter plot distance - fare we only see the total one.
fig, axs = plt.subplots(1, 2, figsize=(16,6))
axs[0].scatter(df_train.distance_miles, df_train.fare_amount, alpha=0.2, color='blue')
#choose the scatter plot type.
axs[0].set_xlabel('Distance Mile')
# set the label of x_axis
axs[0].set_ylabel('Fare $USD')
# set the label of y_axis
axs[0].set_title('All data in big scope')
# set the title of left plot
# zoom in on part of data into the small
idx = (df_train.distance_miles < 15) & (df_train.fare_amount < 100)
# plot the distance smaller than 15 mile.
axs[1].scatter(df_train[idx].distance_miles, df_train[idx].fare_amount, alpha=0.2, color='green')
axs[1].set_xlabel('distance mile')
axs[1].set_ylabel('fare $USD')
axs[1].set_title('Zoom in on distance < 15 mile, fare < $100');
# why less than 15 mile:
Blue: Blue represents the panoramic trip distance from 0 to more than 5000 miles. Means the density from the center out to the edge of the city, and the surrounding suburbs. It also shows a dark blue turn located in close angles from 0 to 15 miles miles is very high. And the few on the right side of the picture over 5000 might be out of town.
Green: We have lots of dark green range. From 0 miles means city center. Rates range from low to high majority below 60 dollar range. The most extensive and densest range is from 0 to 6 miles and some is from 6 to more than 14 miles.
Reason with less than 15 mile:
Because the density of the trip within a 15 mile radius is so great, we wanted to see more clearly. And normal as you can see on the graph different it also shows price below 100 dollar charge with strong density and frequency that the taxi ride was completed large also.
nyc = (-74.0063889, 40.7141667)
### Location limited in NYC map
df_train['distance_to_center'] = distance(nyc[1], nyc[0], df_train.pickup_latitude, df_train.pickup_longitude)
### in here applied function distance to calculate the center miles.
fig, axs = plt.subplots(1, 2, figsize=(19, 15))
im = axs[0].scatter(df_train.distance_to_center, df_train.distance_miles, c=np.clip(df_train.fare_amount, 0, 100),
cmap='plasma', alpha=1.0, s=1)
### Plot the scatter with alpha = 1.0 and color.
axs[0].set_xlabel('pickup distance from NYC center')
### xlabel name
axs[0].set_ylabel('distance miles')
### ylabel name
axs[0].set_title('All data')
### set the data name.
cbar = fig.colorbar(im, ax=axs[0])
cbar.ax.set_ylabel('fare_amount', rotation=270)
### Choose the trip to center < 15
idx = (df_train.distance_to_center < 15) & (df_train.distance_miles < 35)
im = axs[1].scatter(df_train[idx].distance_to_center, df_train[idx].distance_miles,
c=np.clip(df_train[idx].fare_amount, 0, 100), cmap='plasma', alpha=1.0, s=1)
axs[1].set_xlabel('pickup distance from NYC center')
axs[1].set_ylabel('distance miles')
axs[1].set_title('Zoom in')
cbar = fig.colorbar(im, ax=axs[1])
cbar.ax.set_ylabel('fare_amount', rotation=270)
Text(0, 0.5, 'fare_amount')
Left Picture: show the total data of distance miles.
Right Picture: pickup distance from NYC center from 0 to 8000 index which locate from the center to outside of new york.
BarPlot in the right side. There is a lot of 'purple pink' dots, which is about $50 to \$60 fare amount near 13 miles distance of NYC center of distrance of trip. This could be due to trips from/to JFK airport.
It is a part of barplot show the threshold heat of money on the fare amount of the trip price.
The plot does show the density of how hustle and busy of taxi trip happened in the center. As we can observe there are many dot focus in the 0 to 6 number in the x_axis right picture. The less distance and less pickup distance are located in the center of new your.
Further below you can see how busy of the trip in NYC taxi data. It will perform in the plot
Correlation between travel costs and direction.
At this point I am interested in the total number of passengers whose trips are the main feature for predicting future prices. However, what is the direction of the trips. To describe it specifically I made this plot to show the costs specifically. Based on longitude and latitude of trips in NYC city. And the color with Bar plot shows the expensive level of each trip. From pick up point to drop off point
def select_within_boundingbox(df, BB):
return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
(df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
(df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
(df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])
## Support function for select_with_bounding
## Support to make the bouding box 2 data frame.
The function of the select within bounding box is used to select the pickup points on longitude and latitude, this function can be created outside of jupyter as a package. But I have it here, after I choose the latitude and longitude fit, I did an operation to return the average value. For transparency purposes to make the chart below clear. The trip will become apparent.
df_train['delta_lon'] = df_train.pickup_longitude - df_train.dropoff_longitude
df_train['delta_lat'] = df_train.pickup_latitude - df_train.dropoff_latitude
# from planar import BoundingBox
# Select trips in Manhattan
BB_manhattan = (-74.025, -73.925, 40.7, 40.8)
idx_manhattan = select_within_boundingbox(df_train, BB_manhattan)
plt.figure(figsize=(14,8))
plt.scatter(df_train[idx_manhattan].delta_lon, df_train[idx_manhattan].delta_lat, s=0.7, alpha=1.0,
c=np.log1p(df_train[idx_manhattan].fare_amount), cmap='inferno')
plt.colorbar()
## Plot the color heat bar in the right side to ts
plt.xlabel('Pickup_longitude - dropoff_longitude')
plt.ylabel('Pickup_latitude - dropoff_latidue')
plt.title('Fare_amount map observe');
This heat map is showing the fair amount plot from 0 to 4 level.
purple the fair amount is lower not too high.
organge + yellow the fair amount is high because the taxi trip is going fair from the centrum. Some of the trip went out side of the center of this graph. The higher density of taxi trip is concentrate from coorordinate latitude_pickup and pick_longtitude. Finally we can see the trip focus from pick_longtitude from -0.025 to 0.025 it mean in the centrum many customer was booked and other taxi. Mainly it is the Mahattan district.
Outside with orange color. They are a few tiny dot from -0.075 to 0.075 longtitude and latitudewe can know that it is far from the centrum such as going outside near JFK kennedy Air-Port
Vendor occupies a large part of the current dataset in new york. There are two types 1 and vendor 2 vendor, so we'll have a plot analysis of the number of customers based on violin plot.
I used to plot it Seaborn specifically.
import seaborn as sns
sns.set(style="whitegrid", palette="pastel", color_codes=True)
sns.set_context("poster")
train2 = train.copy()
# Just repeat the copy the train and name it train2.
train2['trip_duration']= np.log(train2['trip_duration'])
sns.violinplot(x="passenger_count", y="trip_duration", hue="vendor_id", data=train2, split=True,
inner="quart",palette={1: "g", 2: "r"})
# plot the violin plot.
sns.despine(left=True)
sns.set(rc={'figure.figsize':(15,6)})
print(train2.shape[0])
500000
Ven_Id here we have two vendor in the data base.Vendor 1 and vendor2
According to data available in two vendor categories, passenger count trips of zero have a very long and even distribution tail even below zero.
The trip has a frequency of 5.0 to 7.0 miles, there are about 1 to 6 customers and happens regularly. Some trips with 7 and 9 customers are very few.
This work has follow the reference from Scipy Packge it help us get some more choose to zoome and plot the meta data point of Pickup and dropoff point. We can see how the passenger start the trip. And the central of Mahattan is extremely busy.
taxiDB = data.copy()
allLat = np.array(list(taxiDB['pickup_latitude']) + list(taxiDB['dropoff_latitude']))
allLong = np.array(list(taxiDB['pickup_longitude']) + list(taxiDB['dropoff_longitude']))
## Adjust all latitude and longtitude of data.
longLimits = [np.percentile(allLong, 0.3), np.percentile(allLong, 99.7)]
latLimits = [np.percentile(allLat , 0.3), np.percentile(allLat , 99.7)]
durLimits = [np.percentile(taxiDB['trip_duration'], 0.4), np.percentile(taxiDB['trip_duration'], 99.7)]
#Callculate the duration limited of trip_duration.
taxiDB = taxiDB[(taxiDB['pickup_latitude'] >= latLimits[0] ) & (taxiDB['pickup_latitude'] <= latLimits[1]) ]
taxiDB = taxiDB[(taxiDB['dropoff_latitude'] >= latLimits[0] ) & (taxiDB['dropoff_latitude'] <= latLimits[1]) ]
taxiDB = taxiDB[(taxiDB['pickup_longitude'] >= longLimits[0]) & (taxiDB['pickup_longitude'] <= longLimits[1])]
taxiDB = taxiDB[(taxiDB['dropoff_longitude'] >= longLimits[0]) & (taxiDB['dropoff_longitude'] <= longLimits[1])]
taxiDB = taxiDB[(taxiDB['trip_duration'] >= durLimits[0] ) & (taxiDB['trip_duration'] <= durLimits[1]) ]
taxiDB = taxiDB.reset_index(drop=True)
#
allLat = np.array(list(taxiDB['pickup_latitude']) + list(taxiDB['dropoff_latitude']))
allLong = np.array(list(taxiDB['pickup_longitude']) + list(taxiDB['dropoff_longitude']))
# convert fields to sensible units
medianLat = np.percentile(allLat,50)
medianLong = np.percentile(allLong,50)
latMultiplier = 111.32
longMultiplier = np.cos(medianLat*(np.pi/180.0)) * 111.32
taxiDB['duration [min]'] = taxiDB['trip_duration']/60.0
taxiDB['src lat [km]'] = latMultiplier * (taxiDB['pickup_latitude'] - medianLat)
taxiDB['src long [km]'] = longMultiplier * (taxiDB['pickup_longitude'] - medianLong)
taxiDB['dst lat [km]'] = latMultiplier * (taxiDB['dropoff_latitude'] - medianLat)
taxiDB['dst long [km]'] = longMultiplier * (taxiDB['dropoff_longitude'] - medianLong)
allLat = np.array(list(taxiDB['src lat [km]']) + list(taxiDB['dst lat [km]']))
allLong = np.array(list(taxiDB['src long [km]']) + list(taxiDB['dst long [km]']))
# show the log density of pickup and dropoff locations
imageSize = (700,700)
longRange = [-5,19]
latRange = [-13,11]
allLatInds = imageSize[0] - (imageSize[0] * (allLat - latRange[0]) / (latRange[1] - latRange[0]) ).astype(int)
allLongInds = (imageSize[1] * (allLong - longRange[0]) / (longRange[1] - longRange[0])).astype(int)
locationDensityImage = np.zeros(imageSize)
for latInd, longInd in zip(allLatInds,allLongInds):
locationDensityImage[latInd,longInd] += 1
fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(9,9))
ax.imshow(np.log(locationDensityImage+1),cmap='hot')
fig.tight_layout()
ax.set_axis_off()
Explain the graph reason
The graph from above shows the density of destinations and arrived destinations of customer.
Yellow dots are represented as Meta Point pixels, allowing users to clearly distinguish between buildings from the center to the suburbs. Because New York is a very big city that's why we should look at the rides with this perspective.
The purpose of this map is to be a starting point for predicting and analyzing customer trips, predicting the density of future trips.
Anticipate the places where the customers frequent the area for the most realistic results.
Then I will split it up into two minimaps where the pick-up and drop-off locations are located. For the purpose of predicting density where customers are going to reach, it support a lot to help for companies to see where they are going to be busiest , with the aim to increase revenue or optimize their rides and services plan like Uber and Lyft.
Each trip usually has five main attributes: pickup and dropoff locations and the trip duration. Let's cluster the total number of trips in up to 80 stereotypical template trips, then we can look at the distribution of each interval of trips, find out how it changes over time.
tripAttributes = np.array(taxiDB.loc[:,['src lat [km]','src long [km]','dst lat [km]','dst long [km]','duration [min]']])
meanTripAttr = tripAttributes.mean(axis=0)
stdTripAttr = tripAttributes.std(axis=0)
tripAttributes = stats.zscore(tripAttributes, axis=0)
## Set the number of 80 stereotypical cluster.
numClusters = 80
## Used Kmean to cluster it by function BatchKmean.
TripKmeansModel = cluster.MiniBatchKMeans(n_clusters=numClusters, batch_size=120000, n_init=100, random_state=1)
clusterInds = TripKmeansModel.fit_predict(tripAttributes)
## Then we predict a small feature and model to fit it.
clusterTotalCounts, _ = np.histogram(clusterInds, bins=numClusters)
sortedClusterInds = np.flipud(np.argsort(clusterTotalCounts))
plt.figure(figsize=(12,4)); plt.title('Cluster Histogram of all trip')
plt.bar(range(1,numClusters+1),clusterTotalCounts[sortedClusterInds])
plt.ylabel('Frequency [counts]'); plt.xlabel('Cluster index (sorted by cluster frequency)')
plt.xlim(0,numClusters+1)
(0.0, 81.0)
Based on the Cluster Histogram, it represents the density of the frequencies of the travel specials. I have pooled all my trips into 80 different clusters. That shows that the cluster from 0 to 10 has an output rate of more than 50000, and going down from 20 to 20 Cluster Index has range of 30000 times and away from the city range less than 1000 for the 80th cluster. That means the cluster The center is heavily concentrated.
My abilities are exciting enough to build a similar map to simulate the rides and moves in Tablaue. Then you'll see how I built it on the tablaue using the Map Layer platform. It is quite time consuming to build, and to help the company define real-time data operation I will attach a picture of Tableau below.
#%% show the templeate trips on the map
def ConvertToImageCoords(latCoord, longCoord, latRange, longRange, imageSize):
latInds = imageSize[0] - (imageSize[0] * (latCoord - latRange[0]) / (latRange[1] - latRange[0]) ).astype(int)
longInds = (imageSize[1] * (longCoord - longRange[0]) / (longRange[1] - longRange[0])).astype(int)
return latInds, longInds
## Adjust the image from the latCoord, long Coord, LatRange. They are all the image size which impact into the coordinate of the picture.
templateTrips = TripKmeansModel.cluster_centers_ * np.tile(stdTripAttr,(numClusters,1)) + np.tile(meanTripAttr,(numClusters,1))
## Calculated the template trip.
srcCoords = templateTrips[:,:2] # Start point Coordinate
dstCoords = templateTrips[:,2:4] # Drop point Cordinate
## get the Coordinate of templateTrips.
srcImCoords = ConvertToImageCoords(srcCoords[:,0],srcCoords[:,1], latRange, longRange, imageSize)
dstImCoords = ConvertToImageCoords(dstCoords[:,0],dstCoords[:,1], latRange, longRange, imageSize)
plt.rcParams['axes.grid'] = False
plt.figure(figsize=(14,14))
plt.imshow(np.log(locationDensityImage+1),cmap='hot')
plt.scatter(srcImCoords[1],srcImCoords[0],c='blue',s=200,alpha=0.8)
## set the blue dot for pickup_point
plt.scatter(dstImCoords[1],dstImCoords[0],c='yellow',s=200,alpha=0.8)
## set the yellow scatter dot for drop_off point.
## For all srcImcord i set the arrow line to show the direct by for loop
for i in range(len(srcImCoords[0])):
plt.arrow(srcImCoords[1][i],srcImCoords[0][i], dstImCoords[1][i]-srcImCoords[1][i], dstImCoords[0][i]-srcImCoords[0][i],
edgecolor='c', facecolor='c',width=0.8,alpha=0.8,head_width=14.0,head_length=9.0,length_includes_head=True)
## Arrow have some point to assigned in srcImCord and dstImCoord.
## egdecolor mean the color of the arrow
## width of arrow is 0.8
## alpha = transparency
From the point of view from the sky. Like a helicopter to understand how taxis go back and forth in the city.
Yellow dot: Represents the customer drop-off point.
Blue dot: Represents the customer pick-up point.
Light blue line: represented with arrows showing the direction of the customer.
The majority of customers move within the heart of NYC and travel out of New York City to the airport. There are always rides over 20 km of trip distances which appearing in the outside of city suburb linked with airport and around airport and there are rides under 10 km.
There are also small trips from outside the airport and out of the center in areas where few customers also travel, but with relatively low frequency and frequency.
We have the three busiest spots in the world: Manhattan Center, Kennedy Airport at the right bottom of the map, and the northeast edge.
Trips vary widely from John F. Kennedy Airport moving into the city as a shuttle destination.
## Plot choose right data. ( latitude , longitude)
data = [go.Scattermapbox(
lat= df_train5k['pickup_latitude'] , #get the number of data collum 'pickup_'
lon= df_train5k['pickup_longitude'],
mode='markers', #choose the scatter style.
marker=dict(
size= 4, #size of the scatter scatter
color = 'white', # color of the scatter
opacity = .8, # set the opacity of the scatter.
),
)]
## set the layout behind the map.
layout = go.Layout(autosize=False,
mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
bearing=10,
pitch=60,
zoom=13,
center= dict(
lat=40.721319,
lon=-73.987130),
style= "mapbox://styles/shaz13/cjiog1iqa1vkd2soeu5eocy4i"),
width=900,
height=600, title = "Pick up Locations in NewYork")
fig = dict(data=data, layout=layout)
offline.iplot(fig)
The pick off point is described in detail in this 3D chart, I have plotted the route and the drop off points are very nice. It is intended to simulate the trip more clearly and avoid many buildings, and the complexity of the roads in the map, because this map helps us to see more clearly, and confirm the destinations.
# Get the data and load them into Scattermapbox.
data = [go.Scattermapbox(
lat= df_train5k['dropoff_latitude'] ,
lon= df_train5k['dropoff_longitude'],
mode='markers',
marker=dict(
size= 4,
color = 'cyan',
opacity = .8,
),
)]
## Setting the layout for map.
layout = go.Layout(autosize=False,
mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
bearing=10,
pitch=60,
zoom=13,
center= dict(
lat=40.721319,
lon=-73.987130),
style= "mapbox://styles/shaz13/cjk4wlc1s02bm2smsqd7qtjhs"),
width=900,
height=600, title = "Drop off locations in Newyork")
fig = dict(data=data, layout=layout)
offline.iplot(fig)
Reason: This map is used to confirm a clear and easily identifiable drop-off point.
Blue point for drop off point and red for road map.When we have too many points.With roads like these help the eye more easily distinguishable.
You can point and move the map, this is a very user-friendly and eye-catching Map.
# Use bigger rows due to big data to see it clearly.
west, south, east, north = -74.03, 40.63, -73.77, 40.85
## The litmited of Coordinate in NYC.
train = train[(train.pickup_latitude> south) & (train.pickup_latitude < north)]
train = train[(train.dropoff_latitude> south) & (train.dropoff_latitude < north)]
train = train[(train.pickup_longitude> west) & (train.pickup_longitude < east)]
train = train[(train.dropoff_longitude> west) & (train.dropoff_longitude < east)]
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(15,10))
ax1.grid(False)
ax2.grid(False)
train.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',
color='lightgreen',
s=.02, alpha=.6, subplots=True, ax=ax1, grid=False)
ax1.set_title("Pickups Point")
ax1.set_facecolor('black')
train.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',
color='lightblue',
s=.02, alpha=.6, subplots=True, ax=ax2, grid=False)
ax2.set_title("Dropoffs Point")
ax2.set_facecolor('black')
After clearly analyzing each time of the day, in the month. We can plot out a map about the density of destinations and pick up points.
Bright blue is where the pick-up point is located. We can see that the pick-up point is quite sparse in the center and a little spread out of outside center. At the airport, the guests were quite tall and spread evenly on the way back from the airport.
Bright blue means that the drop off point is very dense which seems to be higher than the pick up point. We can see that the density at the airport is also high. Beyond the edge of the center we can identify drop-off points that are also spread out, and more in the lower left edge of the map on the right.
Both gave us a specific perspective on the density of trips in NYC. We can then use algorithms to predict the customer's future drop-off points and predict the density of customer trips.
After declaration summarizing experience from many angles NYC data analysis.
Fair_amount perspective: we can observe the cost of the day, night, month, year. It gives us a specific perspective on the density of trips. From this point, Data Scientist can predict trips density of hour in future. And find out which feature can impact to the trips effective. The number of trips in future is intented to help Driving Service improve it's cost-effectiveness.
Map Perspective: I have been looking to plot densities based on specific mapping techniques as a Data Analyst student. Before that, I used to work in some data project with companies client, so I am quite interested in improving my Data skills.Therefore an attractive map will attract scientists to read the more specific unit. Map features are also demonstrated through various plotting techniques such as heat map, scatterplot, and arrow plotting.

Tableau gives you a better view of New York City from the air. And detail how the graphs perform from time to cost, with clear breakdowns for each crowded area in NYC. A Tableau that encapsulates multiple graphs gives the company a holistic view.
**END**